{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using pre-trained embedding with Tensorflow Hub\n",
    "\n",
    "**Learning Objectives**\n",
    "1. How to instantiate a Tensorflow Hub module\n",
    "1. How to find pretrained Tensorflow Hub module for variety of purposes\n",
    "1. How to use a pre-trained TF Hub text modules to generate sentence vectors\n",
    "1. How to incorporate a pre-trained TF-Hub module into a Keras model\n",
    "\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "\n",
    "In this notebook, we will implement text models to recognize the probable source (Github, Tech-Crunch, or The New-York Times) of the titles we have in the title dataset.\n",
    "\n",
    "First, we will load and pre-process the texts and labels so that they are suitable to be fed to sequential Keras models with first layer being TF-hub pre-trained modules. Thanks to this first layer, we won't need to tokenize and integerize the text before passing it to our models. The pre-trained layer will take care of that for us, and consume directly raw text. However, we will still have to one-hot-encode each of the 3 classes into a 3 dimensional basis vector.\n",
    "\n",
    "Then we will build, train and compare simple models starting with different pre-trained TF-Hub layers."
   ]
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
    ],
    "execution_count": null,
    "outputs": []
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "!pip install --user google-cloud-bigquery==1.25.0"
    ],
    "execution_count": null,
    "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Restart your kernel to use updated packages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kindly ignore the deprecation warnings and incompatibility errors related to google-cloud-storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from google.cloud import bigquery\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace the variable values in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "BUCKET = PROJECT  # defaults to PROJECT\n",
    "REGION = \"us-central1\"  # Replace with your REGION\n",
    "SEED = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a Dataset from BigQuery \n",
    "\n",
    "Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. \n",
    "\n",
    "Here is a sample of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "\n",
    "SELECT\n",
    "    url, title, score\n",
    "FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "    LENGTH(title) > 10\n",
    "    AND score > 10\n",
    "    AND LENGTH(url) > 0\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "\n",
    "SELECT\n",
    "    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "    COUNT(title) AS num_articles\n",
    "FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "    AND LENGTH(title) > 10\n",
    "GROUP BY\n",
    "    source\n",
    "ORDER BY num_articles DESC\n",
    "  LIMIT 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex = '.*://(.[^/]+)/'\n",
    "\n",
    "\n",
    "sub_query = \"\"\"\n",
    "SELECT\n",
    "    title,\n",
    "    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source\n",
    "    \n",
    "FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')\n",
    "    AND LENGTH(title) > 10\n",
    "\"\"\".format(regex)\n",
    "\n",
    "\n",
    "query = \"\"\"\n",
    "SELECT \n",
    "    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,\n",
    "    source\n",
    "FROM\n",
    "  ({sub_query})\n",
    "WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')\n",
    "\"\"\".format(sub_query=sub_query)\n",
    "\n",
    "print(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bq = bigquery.Client(project=PROJECT)\n",
    "title_dataset = bq.query(query).to_dataframe()\n",
    "title_dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AutoML for text classification requires that\n",
    "* the dataset be in csv form with \n",
    "* the first column being the texts to classify or a GCS path to the text \n",
    "* the last column to be the text labels\n",
    "\n",
    "The dataset we pulled from BiqQuery satisfies these requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The full dataset contains {n} titles\".format(n=len(title_dataset)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure we have roughly the same number of labels for each of our three labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "title_dataset.source.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we will save our data, which is currently in-memory, to disk.\n",
    "\n",
    "We will create a csv file containing the full dataset and another containing only 1000 articles for development.\n",
    "\n",
    "**Note:** It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATADIR = './data/'\n",
    "\n",
    "if not os.path.exists(DATADIR):\n",
    "    os.makedirs(DATADIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FULL_DATASET_NAME = 'titles_full.csv'\n",
    "FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)\n",
    "\n",
    "# Let's shuffle the data before writing it to disk.\n",
    "title_dataset = title_dataset.sample(n=len(title_dataset))\n",
    "\n",
    "title_dataset.to_csv(\n",
    "    FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see [here](https://cloud.google.com/natural-language/automl/docs/beginners-guide) for further details on how to prepare data for AutoML)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_title_dataset = title_dataset.sample(n=1000)\n",
    "sample_title_dataset.source.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's write the sample datatset to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SAMPLE_DATASET_NAME = 'titles_sample.csv'\n",
    "SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)\n",
    "\n",
    "sample_title_dataset.to_csv(\n",
    "    SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_title_dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import os\n",
    "import shutil\n",
    "\n",
    "import pandas as pd\n",
    "import tensorflow as tf\n",
    "from tensorflow.keras.callbacks import TensorBoard, EarlyStopping\n",
    "from tensorflow_hub import KerasLayer\n",
    "from tensorflow.keras.layers import Dense\n",
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
    "from tensorflow.keras.utils import to_categorical\n",
    "\n",
    "\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by specifying where the information about the trained models will be saved as well as where our dataset is located:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_DIR = \"./text_models\"\n",
    "DATA_DIR = \"./data\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As in the previous labs, our dataset consists of titles of articles along with the label indicating from which source these articles have been taken from (GitHub, Tech-Crunch, or the New-York Times):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls ./data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET_NAME = \"titles_full.csv\"\n",
    "TITLE_SAMPLE_PATH = os.path.join(DATA_DIR, DATASET_NAME)\n",
    "COLUMNS = ['title', 'source']\n",
    "\n",
    "titles_df = pd.read_csv(TITLE_SAMPLE_PATH, header=None, names=COLUMNS)\n",
    "titles_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look again at the number of examples per label to make sure we have a well-balanced dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles_df.source.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab, we will use pre-trained [TF-Hub embeddings modules for english](https://tfhub.dev/s?q=tf2%20embeddings%20text%20english) for the first layer of our models. One immediate\n",
    "advantage of doing so is that the TF-Hub embedding module will take care for us of processing the raw text. \n",
    "This also means that our model will be able to consume text directly instead of sequences of integers representing the words.\n",
    "\n",
    "However, as before, we still need to preprocess the labels into one-hot-encoded vectors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "CLASSES = {\n",
    "    'github': 0,\n",
    "    'nytimes': 1,\n",
    "    'techcrunch': 2\n",
    "}\n",
    "N_CLASSES = len(CLASSES)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_labels(sources):\n",
    "    classes = [CLASSES[source] for source in sources]\n",
    "    one_hots = to_categorical(classes, num_classes=N_CLASSES)\n",
    "    return one_hots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "encode_labels(titles_df.source[:4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the train/test splits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's split our data into train and test splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "N_TRAIN = int(len(titles_df) * 0.95)\n",
    "\n",
    "titles_train, sources_train = (\n",
    "    titles_df.title[:N_TRAIN], titles_df.source[:N_TRAIN])\n",
    "\n",
    "titles_valid, sources_valid = (\n",
    "    titles_df.title[N_TRAIN:], titles_df.source[N_TRAIN:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be on the safe side, we verify that the train and test splits\n",
    "have roughly the same number of examples per class.\n",
    "\n",
    "Since it is the case, accuracy will be a good metric to use to measure\n",
    "the performance of our models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "sources_train.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "sources_valid.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create the features and labels we will feed our models with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, Y_train = titles_train.values, encode_labels(sources_train)\n",
    "X_valid, Y_valid = titles_valid.values, encode_labels(sources_valid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_train[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NNLM Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will first try a word embedding pre-trained using a [Neural Probabilistic Language Model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). TF-Hub has a 50-dimensional one called \n",
    "[nnlm-en-dim50-with-normalization](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1), which also\n",
    "normalizes the vectors produced. \n",
    "\n",
    "Once loaded from its url, the TF-hub module can be used as a normal Keras layer in a sequential or functional model. Since we have enough data to fine-tune the parameters of the pre-trained embedding itself, we will set `trainable=True` in the `KerasLayer` that loads the pre-trained embedding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "NNLM = \"https://tfhub.dev/google/nnlm-en-dim50/2\"\n",
    "\n",
    "nnlm_module = KerasLayer(\n",
    "    NNLM, output_shape=[50], input_shape=[], dtype=tf.string, trainable=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [],
   "source": [
    "nnlm_module(tf.constant([\"The dog is happy to see people in the street.\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building the models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's write a function that \n",
    "\n",
    "* takes as input an instance of a `KerasLayer` (i.e. the `nnlm_module` we constructed above) as well as the name of the model (say `nnlm`)\n",
    "* returns a compiled Keras sequential model starting with this pre-trained TF-hub layer, adding one or more dense relu layers to it, and ending with a softmax layer giving the probability of each of the classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_model(hub_module, name):\n",
    "    model = Sequential([\n",
    "        hub_module, # TODO \n",
    "        Dense(16, activation='relu'),\n",
    "        Dense(N_CLASSES, activation='softmax')\n",
    "    ], name=name)\n",
    "\n",
    "    model.compile(\n",
    "        optimizer='adam',\n",
    "        loss='categorical_crossentropy',\n",
    "        metrics=['accuracy']\n",
    "    )\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also wrap the training code into a `train_and_evaluate` function that \n",
    "* takes as input the training and validation data, as well as the compiled model itself, and the `batch_size`\n",
    "* trains the compiled model for 100 epochs at most, and does early-stopping when the validation loss is no longer decreasing\n",
    "* returns an `history` object, which will help us to plot the learning curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_and_evaluate(train_data, val_data, model, batch_size=5000):\n",
    "    X_train, Y_train = train_data\n",
    "\n",
    "    tf.random.set_seed(33)\n",
    "\n",
    "    model_dir = os.path.join(MODEL_DIR, model.name)\n",
    "    if tf.io.gfile.exists(model_dir):\n",
    "        tf.io.gfile.rmtree(model_dir)\n",
    "\n",
    "    history = model.fit(\n",
    "        X_train, Y_train,\n",
    "        epochs=100,\n",
    "        batch_size=batch_size,\n",
    "        validation_data=val_data,\n",
    "        callbacks=[EarlyStopping(), TensorBoard(model_dir)],\n",
    "    )\n",
    "    return history"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training NNLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = (X_train, Y_train)\n",
    "val_data = (X_valid, Y_valid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "nnlm_model = build_model(nnlm_module, 'nnlm')\n",
    "nnlm_history = train_and_evaluate(data, val_data, nnlm_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = nnlm_history\n",
    "pd.DataFrame(history.history)[['loss', 'val_loss']].plot()\n",
    "pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try to beat the best model by modifying the model architecture, changing the TF-Hub embedding, and tweaking the training parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
